Ubik: efficient cache sharing with strict qos for latency- critical workloads Citation
نویسندگان
چکیده
Chip-multiprocessors (CMPs) must often execute workload mixes with different performance requirements. On one hand, user-facing, latency-critical applications (e.g., web search) need low tail (i.e., worst-case) latencies, often in the millisecond range, and have inherently low utilization. On the other hand, compute-intensive batch applications (e.g., MapReduce) only need high long-term average performance. In current CMPs, latency-critical and batch applications cannot run concurrently due to interference on shared resources. Unfortunately, prior work on quality of service (QoS) in CMPs has focused on guaranteeing average performance, not tail latency. In this work, we analyze several latency-critical workloads, and show that guaranteeing average performance is insufficient to maintain low tail latency, because microarchitectural resources with state, such as caches or cores, exert inertia on instantaneous workload performance. Last-level caches impart the highest inertia, as workloads take tens of milliseconds to warm them up. When left unmanaged, or when managed with conventional QoS frameworks, shared last-level caches degrade tail latency significantly. Instead, we propose Ubik, a dynamic partitioning technique that predicts and exploits the transient behavior of latency-critical workloads to maintain their tail latency while maximizing the cache space available to batch applications. Using extensive simulations, we show that, while conventional QoS frameworks degrade tail latency by up to 2.3×, Ubik simultaneously maintains the tail latency of latency-critical workloads and significantly improves the performance of batch applications.
منابع مشابه
The Vantage Cache-partitioning Technique Enables Configurability and Quality-of-service Guarantees in Large-scale Chip Multiprocessors with Shared Caches. Caches Can Have Hundreds of Partitions with Sizes Specified at Cache Line Granularity, While Maintaining High Associativity and Strict Isolation among Partitions
......Shared caches are pervasive in chip multiprocessors (CMPs). In particular, CMPs almost always feature a large, fully shared last-level cache (LLC) to mitigate the high latency, high energy, and limited bandwidth of main memory. A shared LLC has several advantages over multiple, private LLCs: it increases cache utilization, accelerates intercore communication (which happens through the cac...
متن کاملHardware Support for Synchronized Shared Data on Multicore Processors
Multicore processors allow manufacturers to integrate larger numbers of simpler processing cores onto the same chip with few or no changes to the processing core architecture. These processors can simultaneously execute threads from separate processes (multiprogrammed workloads) or from the same multi-threaded application (parallel workloads). The design space for on-chip memory hierarchies inc...
متن کاملCache Design Options for a Clustered Multithreaded Architecture
The design of the memory hierarchy in a multi-core architecture is a critical component since it must meet the capacity (in terms of bandwidth and low latency) and coordination requirements of multiple threads of control. Most previous designs have assumed either a shared L1 data cache (e.g., simultaneous multithreaded architectures) or L1 caches that are private to each individual processor (e...
متن کاملElfen Scheduling: Fine-Grain Principled Borrowing from Latency-Critical Workloads Using Simultaneous Multithreading
Web services from search to games to stock trading impose strict Service Level Objectives (SLOs) on tail latency. Meeting these objectives is challenging because the computational demand of each request is highly variable and load is bursty. Consequently, many servers run at low utilization (10 to 45%); turn off simultaneous multithreading (SMT); and execute only a single service — wasting hard...
متن کاملCoherence Stalls or Latency Tolerance: Informed CPU Scheduling for Socket and Core Sharing
The efficiency of modern multiprogrammed multicore machines is heavily impacted by traffic due to data sharing and contention due to competition for shared resources. In this paper, we demonstrate the importance of identifying latency tolerance coupled with instructionlevel parallelism on the benefits of colocating threads on the same socket or physical core for parallel efficiency. By adding h...
متن کامل